covariate distribution
Maximum Risk Minimization with Random Forests
Freni, Francesco, Fries, Anya, Kühne, Linus, Reichstein, Markus, Peters, Jonas
We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > Switzerland (0.04)
- (4 more...)
DFW: A Novel Weighting Scheme for Covariate Balancing and Treatment Effect Estimation
Khan, Ahmad Saeed, Schaffernicht, Erik, Stork, Johannes Andreas
Estimating causal effects from observational data is challenging due to selection bias, which leads to imbalanced covariate distributions across treatment groups. Propensity score-based weighting methods are widely used to address this issue by reweighting samples to simulate a randomized controlled trial (RCT). However, the effectiveness of these methods heavily depends on the observed data and the accuracy of the propensity score estimator. For example, inverse propensity weighting (IPW) assigns weights based on the inverse of the propensity score, which can lead to instable weights when propensity scores have high variance-either due to data or model misspecification-ultimately degrading the ability of handling selection bias and treatment effect estimation. To overcome these limitations, we propose Deconfounding Factor Weighting (DFW), a novel propensity score-based approach that leverages the deconfounding factor-to construct stable and effective sample weights. DFW prioritizes less confounded samples while mitigating the influence of highly confounded ones, producing a pseudopopulation that better approximates a RCT. Our approach ensures bounded weights, lower variance, and improved covariate balance.While DFW is formulated for binary treatments, it naturally extends to multi-treatment settings, as the deconfounding factor is computed based on the estimated probability of the treatment actually received by each sample. Through extensive experiments on real-world benchmark and synthetic datasets, we demonstrate that DFW outperforms existing methods, including IPW and CBPS, in both covariate balancing and treatment effect estimation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Sweden > Örebro County > Örebro (0.04)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Air (1.00)
- Health & Medicine > Therapeutic Area > Pediatrics/Neonatology (0.46)
Structure Maintained Representation Learning Neural Network for Causal Inference
Sun, Yang, Lu, Wenbin, Zhou, Yi-Hui
Recent developments in causal inference have greatly shifted the interest from estimating the average treatment effect to the individual treatment effect. In this article, we improve the predictive accuracy of representation learning and adversarial networks in estimating individual treatment effects by introducing a structure keeper which maintains the correlation between the baseline covariates and their corresponding representations in the high dimensional space. We train a discriminator at the end of representation layers to trade off representation balance and information loss. We show that the proposed discriminator minimizes an upper bound of the treatment estimation error. We can address the tradeoff between distribution balance and information loss by considering the correlations between the learned representation space and the original covariate feature space. We conduct extensive experiments with simulated and real-world observational data to show that our proposed Structure Maintained Representation Learning (SMRL) algorithm outperforms state-of-the-art methods. We also demonstrate the algorithms on real electronic health record data from the MIMIC-III database.
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > Middle East > Israel (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.54)
Clustered random forests with correlated data for optimal estimation and inference under potential covariate shift
Young, Elliot H., Bühlmann, Peter
We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy. Clustered random forests are shown for certain tree splitting criteria to be minimax rate optimal for pointwise conditional mean estimation, while being computationally competitive with standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how (population level) optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution with respect to which optimal prediction or inference is desired. This highlights a key difference in behaviour, between correlated and independent data, with regards to nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples
Zhang, Thomas T., Lee, Bruce D., Ziemann, Ingvar, Pappas, George J., Matni, Nikolai
A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and statistically dependent within its source, violating vital assumptions in existing theoretical studies. Toward addressing these issues, we establish statistical guarantees for learning general $\textit{nonlinear}$ representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal F \times \mathcal G$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared nonlinear representation. A representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N'$ samples from a target task passed through $\hat g$. We show that when $N \gtrsim C_{\mathrm{dep}} (\mathrm{dim}(\mathcal F) + \mathrm{C}(\mathcal G)/T)$, the excess risk of $\hat f^{(0)} \circ \hat g$ on the target task decays as $\nu_{\mathrm{div}} \big(\frac{\mathrm{dim}(\mathcal F)}{N'} + \frac{\mathrm{C}(\mathcal G)}{N T} \big)$, where $C_{\mathrm{dep}}$ denotes the effect of data dependency, $\nu_{\mathrm{div}}$ denotes an (estimatable) measure of $\textit{task-diversity}$ between the source and target tasks, and $\mathrm C(\mathcal G)$ denotes the complexity of the representation class $\mathcal G$. In particular, our analysis reveals: as the number of tasks $T$ increases, both the sample requirement and risk bound converge to that of $r$-dimensional regression as if $g_\star$ had been given, and the effect of dependency only enters the sample requirement, leaving the risk bound matching the iid setting.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Pennsylvania (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
An Empirical Examination of Balancing Strategy for Counterfactual Estimation on Time Series
Huang, Qiang, Meng, Chuizheng, Cao, Defu, Huang, Biwei, Chang, Yi, Liu, Yan
Counterfactual estimation from observations represents a critical endeavor in numerous application fields, such as healthcare and finance, with the primary challenge being the mitigation of treatment bias. The balancing strategy aimed at reducing covariate disparities between different treatment groups serves as a universal solution. However, when it comes to the time series data, the effectiveness of balancing strategies remains an open question, with a thorough analysis of the robustness and applicability of balancing strategies still lacking. This paper revisits counterfactual estimation in the temporal setting and provides a brief overview of recent advancements in balancing strategies. More importantly, we conduct a critical empirical examination for the effectiveness of the balancing strategies within the realm of temporal counterfactual estimation in various settings on multiple datasets. Our findings could be of significant interest to researchers and practitioners and call for a reexamination of the balancing strategy in time series settings.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > China > Jilin Province (0.04)
- (3 more...)
- Overview (1.00)
- Research Report > New Finding (0.87)
- Research Report > Experimental Study (0.68)
Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck
Yang, Hao, Sun, Zexu, Xu, Hongteng, Chen, Xu
As a promising individualized treatment effect (ITE) estimation method, counterfactual regression (CFR) maps individuals' covariates to a latent space and predicts their counterfactual outcomes. However, the selection bias between control and treatment groups often imbalances the two groups' latent distributions and negatively impacts this method's performance. In this study, we revisit counterfactual regression through the lens of information bottleneck and propose a novel learning paradigm called Gromov-Wasserstein information bottleneck (GWIB). In this paradigm, we learn CFR by maximizing the mutual information between covariates' latent representations and outcomes while penalizing the kernelized mutual information between the latent representations and the covariates. We demonstrate that the upper bound of the penalty term can be implemented as a new regularizer consisting of $i)$ the fused Gromov-Wasserstein distance between the latent representations of different groups and $ii)$ the gap between the transport cost generated by the model and the cross-group Gromov-Wasserstein distance between the latent representations and the covariates. GWIB effectively learns the CFR model through alternating optimization, suppressing selection bias while avoiding trivial latent distributions. Experiments on ITE estimation tasks show that GWIB consistently outperforms state-of-the-art CFR methods. To promote the research community, we release our project at https://github.com/peteryang1031/Causal-GWIB.
- North America > United States (0.14)
- Asia > China (0.04)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.87)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Data Science (0.92)
Informativeness of Weighted Conformal Prediction
Ying, Mufang, Guo, Wenge, Khamaru, Koulik, Hung, Ying
Weighted conformal prediction (WCP), a recently proposed framework, provides uncertainty quantification with the flexibility to accommodate different covariate distributions between training and test data. However, it is pointed out in this paper that the effectiveness of WCP heavily relies on the overlap between covariate distributions; insufficient overlap can lead to uninformative prediction intervals. To enhance the informativeness of WCP, we propose two methods for scenarios involving multiple sources with varied covariate distributions. We establish theoretical guarantees for our proposed methods and demonstrate their efficacy through simulations.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States > New Jersey (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Asia > Middle East > Jordan (0.04)
Learning sources of variability from high-dimensional observational studies
Bridgeford, Eric W., Chung, Jaewon, Gilbert, Brian, Panda, Sambit, Li, Adam, Shen, Cencheng, Badea, Alexandra, Caffo, Brian, Vogelstein, Joshua T.
Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Texas > Tarrant County > Arlington (0.04)
- (9 more...)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.67)
Trained Transformers Learn Linear Models In-Context
Zhang, Ruiqi, Frei, Spencer, Bartlett, Peter L.
Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.
- Europe > Denmark (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)